ggplot2 library¶Introduction to R is brought to you by the Centre for the Analysis of Genome Evolution & Function (CAGEF) bioinformatics training initiative. This course was developed based on feedback on the needs and interests of the Department of Cell & Systems Biology and the Department of Ecology and Evolutionary Biology.
The structure of this course is a code-along style - it is 100% hands on! A few hours prior to each lecture, links to the materials will be avaialable for download at QUERCUS. The teaching materials will consist of a Jupyter Lab Notebook with concepts, comments, instructions, and blank spaces that you will fill out with R by coding along with the instructor. Other teaching materials include an HTML version of the notebook, and datasets to import into R - when required. This learning approach will allow you to spend the time coding and not taking notes!
As we go along, there will be some in-class challenge questions for you to solve either individually or in cooperation with your peers. Post lecture assessments will also be available (see syllabus for grading scheme and percentages of the final mark) through DataCamp to help cement and/or extend what you learn each week.
We'll take a blank slate approach here to R and assume that you pretty much know nothing about programming. From the beginning of this course to the end, we want to get you from some potential scenarios:
A pile of data (like an excel file or tab-separated file) full of experimental observations and you don't know what to do with it.
Maybe you're manipulating large tables all in excel, making custom formulas and pivot table with graphs. Now you have to repeat similar experiments and do the analysis again.
You're generating high-throughput data and there aren't any bioinformaticians around to help you sort it out.
You heard about R and what it could do for your data analysis but don't know what that means or where to start.
and get you to a point where you can:
Format your data correctly for analysis
Produce basic plots and perform exploratory analysis
Make functions and scripts for re-analysing existing or new data sets
Track your experiments in a digital notebook like Jupyter!
In the first two lessons, we will talk about the basic data structures and objects in R, get cozy with the RStudio environment, and learn how to get help when you are stuck. Because everyone gets stuck - a lot! Then you will learn how to get your data in and out of R, how to tidy our data (data wrangling), subset and merge data, and generate descriptive statistics. Next will be data cleaning and string manipulation; this is really the battleground of coding - getting your data into the format where you can analyse it. After that, we will make all sorts of plots for both data exploration and publication. Lastly, we will learn to write customized functions and apply more advanced statistical tests, which really can save you time and help scale up your analyses.
The structure of the class is a code-along style: It is fully hands on. At the end of each lecture, the complete notes will be made available in a PDF format through the corresponding Quercus module so you don't have to spend your attention on taking notes.
This is the fourth in a series of seven lectures. Last lecture we finished up with basic manipulation of data frames with the help of the tidyr package. This week we are taking a break to enjoy the fruits of our labours. Now that we can make properly formatted data frames, we can use these objects as input to produce beautiful, publication-quality data visualizations with the help of the ggplot2 package. This week our topics are broken into:
ggplot and the grammar of graphics using scatterplotsGrey background: Command-line code, R library and function names... fill in the code here if you are coding alongEach week, new lesson files will appear within your JupyterHub folders. We are pulling from a GitHub repository using this Repository git-pull link. Simply click on the link and it will take you to the University of Toronto JupyterHub. You will need to use your UTORid credentials to complete the login process. From there you will find each week's lecture files in the directory /2021-09-IntroR/Lecture_XX. You will find a partially coded skeleton.ipynb file as well as all of the data files necessary to run the week's lecture.
Alternatively, you can download the Jupyter Notebook (.ipynb) and data files from JupyterHub to your personal computer if you would like to run independently of the JupyterHub.
A live lecture version will be available at camok.github.io that will update as the lecture progresses. Be sure to refresh to take a look if you get lost!
As mentioned above, at the end of each lecture there will be a completed version of the lecture code released as a PDF file under the Modules section of Quercus. A recorded version of the lecture will be made available through the University's MyMedia website and a link will be posted in the Discussion section of Quercus.
Sequencing of the V3-V5 hypervariable regions of the 16S rRNA gene
16S rRNA gene amplicon sequencing of 30 latrines from Tanzania and Vietnam at different depths (multiples of 20cm). Microbial abundance is represented in Operational Taxonomic Units (OTUs). Operational Taxonomic Units (OTUs) are groups of organisms defined by a specified level of DNA sequence similarity at a marker gene (e.g. 97% similarity at the V4 hypervariable region of the 16S rRNA gene). Intrinsic environmental factors such as pH, temperature, organic matter composition were also recorded.
We have 3 csv files.
A metadata file (Naming conventions: [Country_LatrineNo_Depth]) with sample names and environmental variables.
An OTU abundance table containing the numbers of operational taxonomic units sampled from various sites.
A wide-format data set of the pit latrine data containing the Taxa abundance information across various sample sites.
B Torondel, JHJ Ensink, O Gunvirusdu, UZ Ijaz, J Parkhill, F Abdelahi, V-A Nguyen, S Sudgen, W Gibson, AW Walker, and C Quince. Assessment of the influence of intrinsic environmental and geographical factors on the bacterial ecology of pit latrines Microbial Biotechnology, 9(2):209-223, 2016. DOI:10.1111/1751-7915.12334
The following packages are used in this lesson:
tidyverse (tidyverse installs several packages for you, like dplyr, readr, readxl, tibble, and ggplot2)RColorBrewer contains a series of different colour palettesviridis contains alternative colour-blind friendly colour palettesggbeeswarm a package to help visualized grouped datapoints in a sensible wayggthemes a source for alternative plot themesggpubr used to generate multi-plot figures for publication gridExtra works with ggpubr to produce multi-plot figuresUpSetR an alternative visualization package to classic Venn diagrams
ggrepel used to avoid text overlap (See Appendix)
Some of these packages should already be installed into your Anaconda base from previous lectures. If not, please review that lesson and load these packages. Remember to please install these packages from the conda-forge channel of Anaconda.
#--------- Install packages to for today's session ----------#
# install.packages("tidyverse", dependencies = TRUE) # This package should already be installed on Jupyter Hub
# None of these packages are already available on JupyterHub
# install.packages("ggbeeswarm", dependencies = TRUE)
# install.packages("ggpubr", dependencies = TRUE)
# install.packages("UpSetR", dependencies = TRUE)
#--------- Load packages to for today's session ----------#
library(tidyverse)
library(ggbeeswarm)
library(RColorBrewer)
library(viridis)
library(ggthemes)
library(ggpubr)
library(UpSetR)
Warning message: "package 'tidyverse' was built under R version 4.0.5" -- Attaching packages --------------------------------------- tidyverse 1.3.1 -- v ggplot2 3.3.3 v purrr 0.3.4 v tibble 3.1.1 v dplyr 1.0.6 v tidyr 1.1.3 v stringr 1.4.0 v readr 1.4.0 v forcats 0.5.1 Warning message: "package 'ggplot2' was built under R version 4.0.5" Warning message: "package 'tibble' was built under R version 4.0.5" Warning message: "package 'tidyr' was built under R version 4.0.5" Warning message: "package 'dplyr' was built under R version 4.0.5" Warning message: "package 'forcats' was built under R version 4.0.5" -- Conflicts ------------------------------------------ tidyverse_conflicts() -- x dplyr::filter() masks stats::filter() x dplyr::lag() masks stats::lag() Loading required package: viridisLite
One approach to effective data visualization relies on the Grammar of Graphics framework originally proposed by Leland Wilkinson (2005).The idea of grammar can be summarized as follows:
The grammar of graphics is a language to communicate about what we are plotting programmatically
It begins with a tidy data frame. It will have a series of observations (rows) each of which will be described across multiple variables (columns). Variables can actually represent qualitative or quantitative measurements or they could be descriptive data about the experiments or experimental groups.
The data units may undergo conversion through a process called scaling (transformation) before being used for plotting.
A subset of data columns are then passed on to be presented in various data plots (scatterplots, boxplots, kernel density estimates, etc.) by using the data to describe visual properties of the plot. We call these visual properties, the aesthetics of the plot. For example, the data being plotted or represented can be visually altered in shape or colour based on accompanying column data.
A plot can have multiple layers (for example, a scatter plot with a regression line) and each of these plot types is referred to as a geom (short of geometric object).
ggplot2¶The grammar of graphics facilitates the concise description of any components of any graphics. Hadly Wickham of tidyverse fame has proposed a variant on this concept - the layered grammar of graphics framework. By following a layered approach of defined components, it can be easy to build a visualization. ggplot2 was made to interact well with tidy (long) datasets. If you are spending lots of time figuring out how to make a scatterplot, your data may not be in the correct format.
The Major Components of the Grammar of Graphics by Dipanjan Sarkar
We can break down the above pyramid by the base components, building from the base upwards.
Data: your visualization always starts here. What are the dimensions you want to visualize. What aspect of your data are you trying to convey?Aesthetics: assign your axes based on the data dimensions you have chosen. Where will the majority of the data fall on your plot? Are there other dimensions (such as categorically encoded groupings) that can be conveyed by aspects like size, shape, colour, fill, etc. Scale: do you need to scale/transform any values to fit your data within a range? This includes layers that map between the data and the aesthetics.Geometric objects: how will you display your data within your visualization. Which geom_* will you use?Statistics: are there additional summary statistics that should be included in the visualization? Some examples include central tendency, spread, confidence intervals, standard error, etc.Facets: will generating subplot of the data add a dimension to our visualization that would otherwise by lost?Coordinate system: will your visualization follow a classis cartesian, semi-log, polar, etc. coordinate system?Let's jump into our first dataset and start building some plots with it shall we?
Let's build our first plot step by step to learn more about how ggplot2 works. We will begin by loading datasets from our 16S rRNA V3-V5 hypervariable region sequencing.
Let's read our data tables in and also load ggplot2. We will be using a variety of packages today that are grouped in tidyverse, so let's load tidyverse. You can see in the package startup message that ggplot2 is one of the attached packages.
# Load the tidyverse package
# library(tidyverse)
metadata <- read_csv("data/metadata_pitlatrine.csv",
col_types = 'cccnnnn') # Here we are explicitly specifying our column types
# Take a look at the metadata structure
print("metadata structure")
str(metadata)
[1] "metadata structure" spec_tbl_df[,7] [81 x 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame) $ Country : chr [1:81] "Tanzania" "Tanzania" "Tanzania" "Tanzania" ... $ Latrine_Number: chr [1:81] "2" "2" "2" "2" ... $ Depth : chr [1:81] "1" "10" "12" "2" ... $ pH : num [1:81] 7.82 9.08 8.84 6.49 6.46 7.69 7.48 7.6 7.55 7.68 ... $ Temp : num [1:81] 25.1 24.2 25.1 29.6 27.9 28.7 29.8 25 28.8 28.9 ... $ TS : num [1:81] 14.5 37.8 71.1 13.9 29.4 ... $ CODt : num [1:81] 874 102 35 389 161 57 107 62 384 372 ... - attr(*, "spec")= .. cols( .. Country = col_character(), .. Latrine_Number = col_character(), .. Depth = col_character(), .. pH = col_number(), .. Temp = col_number(), .. TS = col_number(), .. CODt = col_number() .. )
ggplot object needs data¶We're going to build this first plot layer by layer and that begins with specifying the data source. In this case, let's use metadata to start off our plot. When we see it print, you'll find that there's nothing much displayed as output.
# Initialize our ggplot object with some data
# 1. Data
ggplot(data = metadata)
ggplot object consists of many parameters¶We have, however, created a ggplot object. This is a list of 9 parameters:
Luckily there are some defaults, so we don't have to specify everything, but you can start to see how ggplot objects are highly customizable. So far, we have only specified the data aspect of this object.
# Let's take a quick look at structure of a ggplot object
str(ggplot(metadata))
List of 9 $ data : spec_tbl_df[,7] [81 x 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame) ..$ Country : chr [1:81] "Tanzania" "Tanzania" "Tanzania" "Tanzania" ... ..$ Latrine_Number: chr [1:81] "2" "2" "2" "2" ... ..$ Depth : chr [1:81] "1" "10" "12" "2" ... ..$ pH : num [1:81] 7.82 9.08 8.84 6.49 6.46 7.69 7.48 7.6 7.55 7.68 ... ..$ Temp : num [1:81] 25.1 24.2 25.1 29.6 27.9 28.7 29.8 25 28.8 28.9 ... ..$ TS : num [1:81] 14.5 37.8 71.1 13.9 29.4 ... ..$ CODt : num [1:81] 874 102 35 389 161 57 107 62 384 372 ... ..- attr(*, "spec")= .. .. cols( .. .. Country = col_character(), .. .. Latrine_Number = col_character(), .. .. Depth = col_character(), .. .. pH = col_number(), .. .. Temp = col_number(), .. .. TS = col_number(), .. .. CODt = col_number() .. .. ) $ layers : list() $ scales :Classes 'ScalesList', 'ggproto', 'gg' <ggproto object: Class ScalesList, gg> add: function clone: function find: function get_scales: function has_scale: function input: function n: function non_position_scales: function scales: NULL super: <ggproto object: Class ScalesList, gg> $ mapping : Named list() ..- attr(*, "class")= chr "uneval" $ theme : list() $ coordinates:Classes 'CoordCartesian', 'Coord', 'ggproto', 'gg' <ggproto object: Class CoordCartesian, Coord, gg> aspect: function backtransform_range: function clip: on default: TRUE distance: function expand: TRUE is_free: function is_linear: function labels: function limits: list modify_scales: function range: function render_axis_h: function render_axis_v: function render_bg: function render_fg: function setup_data: function setup_layout: function setup_panel_guides: function setup_panel_params: function setup_params: function train_panel_guides: function transform: function super: <ggproto object: Class CoordCartesian, Coord, gg> $ facet :Classes 'FacetNull', 'Facet', 'ggproto', 'gg' <ggproto object: Class FacetNull, Facet, gg> compute_layout: function draw_back: function draw_front: function draw_labels: function draw_panels: function finish_data: function init_scales: function map_data: function params: list setup_data: function setup_params: function shrink: TRUE train_scales: function vars: function super: <ggproto object: Class FacetNull, Facet, gg> $ plot_env :<environment: R_GlobalEnv> $ labels : Named list() - attr(*, "class")= chr [1:2] "gg" "ggplot"
aes() determines attributes of the mapping list and how data is displayed¶The next step is to choose the data we are plotting (aesthetics) and how it influences the visualization. At this point the data can be scaled directly and the axes appear. We have not yet specified how we want the data plotted, only which data should be plotted. In practice, people usually omit 'mapping = ', but it is a good reminder that mapping is, in fact, what we are doing.
When we start customizing our plot, our code starts to get a bit harder to read on one line. We can create each specification on a new line by ending each line with a +.
For our plot, we'll specify the x and y axis using data from the TS (total solids) and CODt (total chemical oxygen demand) variables.
# Add the aes() parameter to our plot
ggplot(data = metadata, mapping = aes(x = TS, y = CODt))
# We can make it equivalent by adding aes() like a layer
# 1. Data
ggplot(metadata) +
# 2. Aesthetics
aes(x = TS, y = CODt)
geom_point()¶We now have to choose the geometric object (geom) with which to plot our data, in this case a point. A geom could be a line, a bar, a boxplot - you can type geom_ and then Tab to see all of the available options. Autocomplete can also be helpful for remembering syntax.
Some helpful geom commands:
| Command | Geom description | Used for |
|---|---|---|
geom_point() |
Single points of data plotted on an x and y axis | scatterplots, dotplots, bubble charts |
geom_bar() |
Barchart summarizing data based with heights proportional to size of its group | barplots and stacked barplots |
geom_col() |
Barchart summarizing data where with heights representing values in the data | barplots of data values? |
geom_boxplot() |
Produce a rough visualization of data distribution | boxplots |
geom_line() |
Track values of multiple groups along an x-axis such as time | line graphs |
geom_jitter() |
When datapoint overlap too much, you can spread them out using jitter | Helpful for boxplots |
geom_violin() |
Combines a kernel distribution estimate in a boxplot-style format | Known as the violin plot |
Since we are making a scatterplot we'll want to go with the geom_point() function.
# Add our data points to the ggplot object
# 1. Data
ggplot(metadata) +
# 2. Aesthetics
aes(x = TS, y = CODt) +
# 3. Scaling
# 4. Geoms
geom_point()
aes() based on factors¶The data looks like there may be two groupings. My guess would be that this is for the 2 different countries. We can easily test this by colouring our points by Country. Look at the structure of metadata in either the Global Environment or using str(). To do this in R, we want to base our colouring on levels from a factor. Afterwards a legend will be automatically created for you.
To accomplish this, we first need to make sure that "Country" is a factor.
print("Our original metadata file")
str(metadata)
[1] "Our original metadata file" spec_tbl_df[,7] [81 x 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame) $ Country : chr [1:81] "Tanzania" "Tanzania" "Tanzania" "Tanzania" ... $ Latrine_Number: chr [1:81] "2" "2" "2" "2" ... $ Depth : chr [1:81] "1" "10" "12" "2" ... $ pH : num [1:81] 7.82 9.08 8.84 6.49 6.46 7.69 7.48 7.6 7.55 7.68 ... $ Temp : num [1:81] 25.1 24.2 25.1 29.6 27.9 28.7 29.8 25 28.8 28.9 ... $ TS : num [1:81] 14.5 37.8 71.1 13.9 29.4 ... $ CODt : num [1:81] 874 102 35 389 161 57 107 62 384 372 ... - attr(*, "spec")= .. cols( .. Country = col_character(), .. Latrine_Number = col_character(), .. Depth = col_character(), .. pH = col_number(), .. Temp = col_number(), .. TS = col_number(), .. CODt = col_number() .. )
# We can directly replace the Country variable with a factor instead of a character
metadata$Country <- factor(metadata$Country, levels=unique(metadata$Country))
print("Our updated metadata file")
str(metadata)
[1] "Our updated metadata file" spec_tbl_df[,7] [81 x 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame) $ Country : Factor w/ 2 levels "Tanzania","Vietnam": 1 1 1 1 1 1 1 1 1 1 ... $ Latrine_Number: chr [1:81] "2" "2" "2" "2" ... $ Depth : chr [1:81] "1" "10" "12" "2" ... $ pH : num [1:81] 7.82 9.08 8.84 6.49 6.46 7.69 7.48 7.6 7.55 7.68 ... $ Temp : num [1:81] 25.1 24.2 25.1 29.6 27.9 28.7 29.8 25 28.8 28.9 ... $ TS : num [1:81] 14.5 37.8 71.1 13.9 29.4 ... $ CODt : num [1:81] 874 102 35 389 161 57 107 62 384 372 ... - attr(*, "spec")= .. cols( .. Country = col_character(), .. Latrine_Number = col_character(), .. Depth = col_character(), .. pH = col_number(), .. Temp = col_number(), .. TS = col_number(), .. CODt = col_number() .. )
The aes() parameter of colour can be mapped to Country when we first specify 'x' and 'y' . By now you've noticed that we have been setting specific attributes in an order that matches our diagram of the grammar of graphics pyramid.
# Add our data points to the ggplot object
# 1. Data
ggplot(metadata) +
# 2. Aesthetics
aes(x = TS, y = CODt, colour = Country) +
# 3. Scaling
# 4. Geoms
geom_point()
aes() unless explicitly specified¶When setting the aes() parameter there are generally three ways to do this in order of inheritance
ggplot(data = ... , mapping = (aes(x = ..., y = ..., colour = ...)))(aes(x = ..., y = ..., colour = ...))geom_*(aes(x = ..., y = ..., colour = ...))This means that colour can be specified using geom_point(aes()) since it is a description of the points being plotted. When building a plot, using this command will supercede the plot's default mappings (if any were created and inherited). By placing version 2 into our code, at the beginning of our plot, we are essentially overriding the default mappings, which are nothing. I prefer to write the code this way for easier reading but method 1 is the more formal way of setting a default mapping to your plot.
It is rare that you might use 3 but not impossible. Especially when layering multiple geom_*() objects, you may find that you want them coloured in one way, but shaped or sized based on a different factor. Setting the default mappings at the start, takes away having to input this information into each new layer of your ggplot object.
#is equivalent in final output to but subsequent layers won't inherit this! compare the consequences
# 1. Data
ggplot(metadata) +
# 2. Aesthetics
aes(x = TS, y = CODt) +
# 3. Scaling
# 4. Geoms
geom_point(aes(colour = Country))
scale_y_*()¶Some of our data points seem to be crushed near the x-axis. We can scale the y-axis to fix this. There are a number of ways to specify how the axes of our graph can be scaled. This is usually accomplished through the commands
scale_y_*() and scale_x_*() where * denotes a number of options in R including
discrete continuouslog10Within these commands we can further specify the axis name, limits (start and end), breaks (tick mark locations), labels for each break, and axis (not data) transformations. In this case, lets keep it simple and log-transform our y-axis.
# Convert the y-axis to a log10 scale
# 1. Data
ggplot(metadata) +
# 2. Aesthetics
aes(x = TS, y = CODt, colour = Country) +
# 3. Scaling
scale_y_log10() +
# 4. Geoms
geom_point()
aes()¶Keep in mind that scaling does not change the data, but rather the representation of the data. The y-axis has been scaled. This is different than taking the log10 of the y data.
Can we transform our data directly? Yes, by manipulating the data in our specification of the y-axis data itself in our aes() call but take a close look at the resulting y-axis!
# Update the y-axis aesthetic to scale the data directly.
# 1. Data
ggplot(metadata) +
# 2. Aesthetics
aes(x = TS, y = log10(CODt), colour = Country) +
# 3. Scaling
# 4. Geoms
geom_point()
The placement of the points looks the same, but the first graph is scaling the axis while the second graph has changed the data to a log10 scale.
facet_*() to display multiple conditions in separate panels¶Faceting allows us to split our data into groups to display in a more separated fashion. Note that we have removed the colour specification in our groups as splitting the data into separate graphs accomplishes the same distinction.
It is good data visualization practice to only have one attribute (colour, shading, faceting, symbols) per grouping.
# Update our aesthetics and add a facet
# 1. Data
ggplot(metadata) +
# 2. Aesthetics
aes(x = TS, y = CODt) +
# 3. Scaling
scale_y_log10() +
# 4. Geoms
geom_point() +
# 6. Facets
facet_grid(Country ~ .) # use facet_grid to split panels by country
repr¶You may have noticed that when your legends are quite large, you lose some real estate on your actual graph. This is due to how we output the graphs both when saving and in displaying. In Jupyter, the standard output dimension is a 7x7inch graph.
When displaying your graphs in Jupyter you can update options through the repr library to widen or lengthen graphs as you create them with big legends or multiple facets. We'll talk about the process of saving them soon as well. First, let's fix our previous graph.
# Load the repr package to change width
library(repr)
# Adjust our plot window size according to the expected output
options(repr.plot.width=12, repr.plot.height=7)
# Update our aesthetics and add a facet
# 1. Data
ggplot(metadata) +
# 2. Aesthetics
aes(x = TS, y = CODt) +
# 3. Scaling
scale_y_log10() +
# 4. Geoms
geom_point() +
# 6. Facets
facet_grid(. ~ Country) # use facet_grid to split panels by country
I could now add information from another variable as a colour in this plot. Note that if a variable is continuous instead of discrete, the colour will be a gradient.
# Update our aesthetics to colour by Temp
# 1. Data
ggplot(metadata) +
# 2. Aesthetics
aes(x = TS, y = CODt, colour = Temp) + # Update the colour to a continuous variable
# 3. Scaling
scale_y_log10() +
# 4. Geoms
geom_point() +
# 6. Facets
facet_grid(. ~ Country)
aes() parameters¶If we really wanted to we could add another variable to the plot by changing the shape attribute. Let's change Country to having 2 shapes and facet by the Depth of the latrine instead.
Note that shape can only be used for discrete values.
A quick reference key for shapes can be found in the 'Cookbook for R' (http://www.cookbook-r.com/Graphs/Shapes_and_line_types/).
# Change our point shape by Country and facet by Depth
# 1. Data
ggplot(metadata) +
# 2. Aesthetics
aes(x = TS, y = CODt, colour = Temp, shape = Country) + # Set the shape by country
# 3. Scaling
scale_y_log10() +
# 4. Geoms
geom_point() +
# 6. Facets
facet_grid(. ~ Depth)
facet_*()¶Note that up until now we've been using facet_grid(~variable) to split our data by variable. This annotation causes the grids to be distributed horizontally. Other ways to facet by a single variable are:
facet_grid(variable~.) will distribute your grids verticallyfacet_wrap(~variable) will return a symmetrical matrix of plots based on levels in your variable.We can now note that only Tanzania had sampling greater than 4cm of depth. There are single latrines for 4 samples. There was no latrine at a depth of 11cm. Lack of replication and a bias towards Tanzania for the higher depths is something we should keep in mind while looking at this data. Depending on the question we are trying to answer, we may want to remove some of these data.
One thing that is not necessary in this case - but good to know about - is the ability to allow each grid to have its own independent axis scale. In our example, wells of Depths from 1-4cm have up to 1000 CODt, while the other wells barely have values past 100 CODt. This can be changed, but keep in mind most people will assume all grids have the same scale, so take extra care to point out that the scales are different when presenting or publishing.
# Use facet_wrap to rescale our y-axis individually
# 1. Data
ggplot(metadata) +
# 2. Aesthetics
aes(x = TS, y = CODt, colour = Temp, shape = Country) + # Set the shape by country
# 3. Scaling
scale_y_log10() +
# 4. Geoms
geom_point() +
# 6. Facets
facet_wrap(. ~ Depth, scales = "free_y") # switch to a facet_wrap with the y-axis allowed to self-scale
Have you noticed something strange about the past two versions of our graph? When faceting by "Depth", how are our groups being ordered? Why?
# Let's remind ourselves of the metadata structure
glimpse(metadata)
# and fix what might be wrong
metadata$Depth = as.numeric(metadata$Depth)
Rows: 81 Columns: 7 $ Country <fct> Tanzania, Tanzania, Tanzania, Tanzania, Tanzania, Tanza~ $ Latrine_Number <chr> "2", "2", "2", "2", "2", "2", "2", "2", "3", "3", "3", ~ $ Depth <chr> "1", "10", "12", "2", "3", "6", "7", "9", "2", "3", "5"~ $ pH <dbl> 7.82, 9.08, 8.84, 6.49, 6.46, 7.69, 7.48, 7.60, 7.55, 7~ $ Temp <dbl> 25.1, 24.2, 25.1, 29.6, 27.9, 28.7, 29.8, 25.0, 28.8, 2~ $ TS <dbl> 14.53, 37.76, 71.11, 13.91, 29.45, 65.52, 36.03, 46.87,~ $ CODt <dbl> 874, 102, 35, 389, 161, 57, 107, 62, 384, 372, 315, 143~
# Plot our data with the updated metadata object
# 1. Data
ggplot(metadata) +
# 2. Aesthetics
aes(x = TS, y = CODt, colour = Temp, shape = Country) + # Set the shape by country
# 3. Scaling
scale_y_log10() +
# 4. Geoms
geom_point() +
# 6. Facets
facet_wrap(. ~ Depth, scales = "free_y")
You can also add statistical transformations to your plots. Again, take a look at stat_ then Tab to see the list of options. In this case let's separately fit a linear regression line to CODt vs TS for each country. The grey area around the line is the confidence interval (default=0.95) and can be removed with the additional call to stat_smooth of se = FALSE.
# Add our regression line with stat_smooth
# 1. Data
ggplot(metadata) +
# 2. Aesthetics
aes(x = TS, y = CODt) +
# 3. Scaling
scale_y_log10() +
# 4. Geoms
geom_point() +
# 5. statistics
stat_smooth(method = lm) + # add in some regression lines for our data
# 6. Facets
facet_grid(~Country)
`geom_smooth()` using formula 'y ~ x'
alpha parameter to de-emphasize data¶A linear model is not always the best fit. The method of calculating the smoothing function can be changed to other provided functions (such as loess - short for local regression, used below) or can be a custom formula. We'll talk more about making our own models in Lecture 06! Note that I changed the confidence interval by modifying level=0.8.
geoms can be made more transparent with the alpha parameter, which is set to 0.3 in the following code so that the emphasis is on the regression line rather than the points.
# Set the alpha on geom_point and change our regression method
# 1. Data
ggplot(metadata) +
# 2. Aesthetics
aes(x = TS, y = CODt) +
# 3. Scaling
scale_y_log10() +
# 4. Geoms
geom_point(alpha = 0.3) + # Set the alpha level to make our points more transparent
# 5. statistics
# Loess: locally estimated scatter plot smoothing.
# Non-parametric approach that fits multiple regressions in local neighborhood
stat_smooth(method = loess, level = 0.8) +
# 6. Facets
facet_grid(~Country)
`geom_smooth()` using formula 'y ~ x'
Now that we have some of the basics, it's time to take a closer look at using other types of plots. We'll begin by reviewing the latrines dataset after loading it into memory as the variable latrines.
# Load up the taxa_pitlatrine.csv dataset
latrines <- read_csv("data/taxa_pitlatrine.csv", col_types = 'ccici')
print("latrines structure")
str(latrines)
[1] "latrines structure" spec_tbl_df[,5] [4,212 x 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame) $ Taxa : chr [1:4212] "Acidobacteria_Gp1" "Acidobacteria_Gp10" "Acidobacteria_Gp14" "Acidobacteria_Gp16" ... $ Country : chr [1:4212] "Tanzania" "Tanzania" "Tanzania" "Tanzania" ... $ Latrine_Number: int [1:4212] 2 2 2 2 2 2 2 2 2 2 ... $ Depth : chr [1:4212] "1" "1" "1" "1" ... $ OTUs : int [1:4212] 0 0 0 0 0 0 0 0 0 0 ... - attr(*, "spec")= .. cols( .. Taxa = col_character(), .. Country = col_character(), .. Latrine_Number = col_integer(), .. Depth = col_character(), .. OTUs = col_integer() .. )
What is the distribution of OTUs by sample site in each country? We can quickly visualize this property by making a density plot for OTUs by Country. I have set alpha (transparency) to 0.3 so that we can see both countries on our plot.
# Build a density plot of your data
# 1. Data
ggplot(latrines) +
# 2. Aesthetics
aes(x=OTUs, fill = Country) +
# 4. Geoms
geom_density(alpha=0.3)
xlim()¶The first thing to notice is that everything is clumped at 0. This is because we have not filtered our data frame to remove all observations where OTUs are zero. The other thing to notice is that there is a long tail where there will only be a few observations. It will be necessary to change the x-axis to see our data. This is done by setting 'limits' (lower and upper boundaries) on the x-axis with xlim(). Note that R gives us a warning that we are not viewing 158 of our 4212 rows.
Let's do the following:
xlim().geom_rug so that we can see where each value falls along the distribution.# call data and filter "on the fly"
latrines %>%
filter(OTUs >=2) %>% # filter the data before passing it to ggplot
# 1. Data
ggplot(.) +
# 2. Aesthetics
aes(x=OTUs, fill=Country) +
# 3. Scaling
xlim(0, 1000) +
# 4. Geoms
geom_density(alpha = 0.3) + # Move our alpha here because we only want to affect this geom
geom_rug()
# geom_rug adds lines on the desired axis to indicate data points.
# Rug plots display individual cases so are best used with smaller datasets.
Warning message: "Removed 158 rows containing non-finite values (stat_density)."
Unlike density plots, histograms count the number of observations you have in each 'bin' that you specify. So with proper parameters you can recreate a similar shape to your density plots using only the observed data.
# Let's make a filtered_latrines variable to work with
filtered_latrines <- latrines %>% filter(OTUs >=2)
# 1. Data
ggplot(filtered_latrines) +
# 2. Aesthetics
aes(x=OTUs, fill=Country) +
# 3. Scaling
xlim (0, 1000) +
# 4. Geoms
geom_histogram(alpha=0.5) # Change our plot to a histogram
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`. Warning message: "Removed 158 rows containing non-finite values (stat_bin)." Warning message: "Removed 4 rows containing missing values (geom_bar)."
position parameter to alter how data is stacked¶Instead of having the countries information stacked, we may want to see the data side by side. This can be done with the parameter position set to dodge. Let's do the following:
ylim()geom_rug()# Update with dodgine the data, ylim and geom_rug
# 1. Data
ggplot(filtered_latrines) +
# 2. Aesthetics
aes(x=OTUs, fill=Country) +
# 3. Scaling
xlim (0, 1000) +
ylim (0, 350) + # Update the y-axis limits
# 4. Geoms
geom_histogram(binwidth = 50, position = "dodge", alpha=0.5) # dodge our two set of data
Warning message: "Removed 158 rows containing non-finite values (stat_bin)." Warning message: "Removed 2 rows containing missing values (geom_bar)."
Can we create a bar plot of OTUs per country? With geom_bar() and the proper aes() we can fill in colour along the bar to represent specific Taxa.
The default use of geom_bar() is to create a barchart where the height of each bar is proportional to the number of observations (ie rows in Taxa) for a particular group (ie Country). The default argument for this calculation in geom_bar() is stat="bin". We can, however, use the sum of values of a variable by using stat=identity instead but a y variable must be identified.
# What happens if we don't specify an "identity" and y-axis value?
# 1. Data
ggplot(filtered_latrines) +
# 2. Aesthetics
aes(x = Country, fill = Taxa) +
# 4. Geoms
geom_bar()
# Make a bar graph based on OTUs and fill by taxa
# 1. Data
ggplot(filtered_latrines) +
# 2. Aesthetics
aes(x = Country, y = OTUs, fill = Taxa) + # Include a y-axis value for calculating totals
# 4. Geoms
geom_bar(stat = "identity") # Sum OTU values instead of rows per taxa
fct_reorder2()¶In the legend for the bar plot our factor levels for Taxa are in alphabetical order. However, an ordering that might be more useful would be an order that matches our data, for example, Taxa in descending order of OTU abundance.
To do this, we can use the forcats package (included in tidyverse and already loaded). fct_reorder2() takes the factor we want ordered (Taxa) and associates the highest x values (Country) with their y values (OTUs) ordering them in descending order. This is most relevent when working with line graphs but works for our bar graphs too.
We will talk about customizing plots later in this lesson, but the last line of code is one way to remove the legend title (by making the legend.title 'blank').
# Reorder our Taxa with fct_reorder2
# 1. Data
ggplot(filtered_latrines) +
# 2. Aesthetics
aes(x = Country, y = OTUs, fill = fct_reorder2(Taxa, Country, OTUs)) + # Reorder our Taxa by country and OTU
theme(legend.title = element_blank()) + # Update the legend by removing the title
# 4. Geoms
geom_bar(stat = "identity")
Now our highest abundance Taxa is the first value in our legend, and this matches the order of the Taxa in our bar graph. The legend order now has meaning.
You can alternate between 'stacked' or 'dodged' (as we did with the histogram) for whether your bars are on top or next to each other when splitting by a factor or categorical variable.
# Update our graph to dodge the bars instead of stacking
# 1. Data
ggplot(filtered_latrines) +
# 2. Aesthetics
aes(x = Country, y = OTUs, fill = fct_reorder2(Taxa, Country, OTUs)) + # Reorder our Taxa by country and OTU
theme(legend.title = element_blank()) + # Update the legend by removing the title
# 4. Geoms
geom_bar(stat = "identity", position = "dodge") # Switch to dodging the bars instead of stacking
coord_flip()¶Our data looks quite squished when displaying the bars vertically. You can have your bars horizontal instead of vertical by using the coord_flip() layer.
# Adjust our plot window size according to the expected output
options(repr.plot.width=14, repr.plot.height=7)
# Let's flip this graph
# 1. Data
ggplot(filtered_latrines) +
# 2. Aesthetics
aes(x = Country, y = OTUs, fill = fct_reorder2(Taxa, Country, OTUs)) + # Reorder our Taxa by country and OTU
theme(legend.title = element_blank()) + # Update the legend by removing the title
# 4. Geoms
geom_bar(stat = "identity", position = "dodge") + # Switch to dodging the bars instead of stacking
# 7. Coordinates
coord_flip()
Boxplots are a great way to visualize summary statistics for your data. As a reminder, the thick line in the center of the box is the median. The upper and lower ends of the box are the first and third quartiles (or 25th and 75th percentiles) of your data. The whiskers extend to the largest value no further than 1.5*IQR (inter-quartile range - the distance between the first and third quartiles). Data beyond these whiskers are considered outliers and plotted as individual points. This is a quick way to see how comparable your samples or variables are.
We are going to use boxplots to see the distribution of OTUs per Taxa across all samples.
# Let's make a basic boxplot with our original latrines data
# 1. Data
ggplot(latrines) +
# 2. Aesthetics
aes(x = ..., y = ...) +
# 4. Geoms
...
theme() and angle¶So we can immediately see there are some issues with the plot. Text along the x-axis is overlapping and illegible. Let's fix the text on the x-axis by rotating it 90 degrees. To accomplish this we will use the theme() layer.
# Access the theme of the plot and update the text angle
# 1. Data
ggplot(latrines) +
# 2. Aesthetics
aes(x = Taxa, y = OTUs) +
theme(...) +
# 4. Geoms
geom_boxplot()
theme() and hjust¶We've updated the angle of our text but they're positioned on somewhat of a "centred" alingment. We can justify the labels such that they align with the x-axis.
You may ask why this is a horizontal (hjust) justification, when it seems like moving the labels upwards towards the x-axis should be a vertical (vjust) justification. If you look in the help menu at element_text() you will see that the justification is carried out before the rotation. While we can specify the parameters of element_text() in any order, this does not change the order they are executed in the function.
# Update our plot to push our text to align with the x-axis
# 1. Data
ggplot(latrines) +
# 2. Aesthetics
aes(x = Taxa, y = OTUs) +
theme(axis.text.x = element_text(angle = 90, ...)) +
# 4. Geoms
geom_boxplot()
While it is clear that Clostridia is the most represented taxa, it is difficult to tell whether some other taxa have no representation, or if they simply have low representation. Transforming to a log scale on the y-axis may clarify this question for us.
# Transform our y-axis to the log scale
# 1. Data
ggplot(latrines) +
# 2. Aesthetics
aes(x = Taxa, y = OTUs) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
# 3. Scaling
...+
# 4. Geoms
geom_boxplot()
Do you know why were are getting the warning: Transformation introduced infinite values in continuous y-axis? What if we were working with filtered_latrines instead?
For now, let's just keep the Taxa that are common to both countries and have greater than 1 OTU counts. To do this we can revisit our favourite dplyr verbs:
filtered_latrines group_by() Country and Taxa to get Country-specific information about Taxa filter() Taxa that are shown to appear in more than one sample. ungroup() to select our final list or we will end up with a grouped object. duplicated() function to retain those Taxa that were found twice (ie. in both countries).The %in% function returns a logical vector of whether there is a match or not for its first argument in its second argument. It is useful for subsetting.
We can facet our plot by Country to verify we have only commonly appearing OTUs.
keep <- filtered_latrines %>%
group_by(...) %>%
summarize(n = n()) %>% #count the number of entries
filter(n > 1) %>% # get rid of any entries with only a single sample
ungroup() %>% # remove the grouping
select(Taxa) %>% # look only at the Taxa column
filter(duplicated(.)) # keep only the samples that are listed twice (ie in both countries)
# Make a new object that filters your taxa
sub_filtered_latrines <-
filtered_latrines %>%
filter(Taxa %in% keep$Taxa)
# Plot that graph!
# 1. Data
ggplot(...) +
# 2. Aesthetics
aes(x = Taxa, y = OTUs) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
# 3. Scaling
scale_y_log10() +
# 4. Geoms
geom_boxplot() +
# 6. Facets
facet_grid(~Country)
We will be using this graph as a base for customization later in the lesson.
Even though boxplots give us summary statistics on our data, it is useful to be able to see where our individual data points are. We've already used geom_rug() to help visualize our data distribution in density plots.
Similarly, for a boxplot we can add the data as another layer using geom_point() to place dots on top of our boxplot, or use geom_jitter() to spread our points out a bit. However, a beeswarm plot places data points that are overlapping next to each other, so we can get a better picture of the distribution of our data. We'll start off by looking at the geom_beeswarm() function from the ggbeeswarm package.
I have subset the data to make a boxplot with 3 Taxa and 1 Country so that we can see the differences in geom options a bit better.
# Reset image sizes for our comfort
options(repr.plot.width=7, repr.plot.height=7)
# Select only taxa from Tanzania that is either Clostridia, Unknown, or Bacilli
subdat <-
latrines %>%
filter((Taxa=="Clostridia" | Taxa == "Unknown" | Taxa == "Bacilli") &
Country == "Tanzania" &
OTUs >0 ) # filter out 0 values or you'll get an error down the road!
# Save the plot to a variable
boxplot <-
# 1. Data
ggplot(...) +
# 2. Aesthetics
aes(x = Taxa, y = OTUs) +
theme(axis.text.x = element_text(angle=90, hjust=1)) +
# 3. Scaling
scale_y_log10() +
# 4. Geoms
geom_boxplot()
# Display the plot
boxplot
ggplot objects as variables that you can continue to update¶As you can see above, an option with ggplot2 is to save your plot into a ggplot object. This works well if you know you are only changing one or two elements of your plot, and you do not want to keep retyping code. What we are going to vary here is how the data points are displayed.
Now, we can simply overlay the points with geom_beeswarm(). Notice that this geom comes from the ggbeeswarm package and is not a part of ggplot2 itself. However, it was built to work with ggplot2 objects!
# Load the ggbeeswarm package
library(ggbeeswarm)
# Add a geom to our saved plot
boxplot + ...
cex is a common parameter used to adjust plotting properties of text and symbols¶Depending on the function or geom you may often find that the cex parameter can be adjusted to alter some aspect of how a geom or other graphical layer is displayed. In the case of geom_beeswarm() we can increase the spacing between data points to make its distribution a bit clearer.
# Update the cex parameter
boxplot + ...
geom_quasirandom()¶If you think you will have many points to display or if you want to avoid adjusting parameters with each new plot, consider using a geom_quasirandom() to give the empirical distribution of the stripplot to avoid overplotting. It is a geom included with the ggbeeswarm package and can simplify the look and creation of your plots
# replace geom_beeswarm() with a geom_quasirandom()
boxplot + ...
Other spacing and distribution options are available at https://github.com/eclarke/ggbeeswarm.
Up to this point, we've discussed some of the ways to alter our plots in ways to best visualize our data by dabbling with colour, fill, shape, and position. This has been accomplished mostly through the aes() attribute. Depending on the nature of the layers or elements you add, we can alter their characteristics indidually to customize.
Plot elements relating to your data include things like axis labels, titles, colour or shapes that represent subsets of your data, scaling that is data-dependent, legends, and other data-driven parameters.
For customizing your data it is possible to change:
colour()fill()shape()size()alpha()Titles and axis labels can be added using:
ggtitle()xlab()ylab()We have seen in the above examples that colour can be applied to discrete or continuous variables. We can also use colour (shape, etc.) to represent outliers. In this dataset, outliers beyond the whiskers (above or below 1.5*IQR) can be coloured red.
Let's review our base boxplot of shared taxa between countries from section 2.3.3. We'll quickly update it to include some colour:
# Reset image sizes for something a little wider
options(repr.plot.width=14, repr.plot.height=7)
# 1. Data
ggplot(sub_filtered_latrines) +
# 2. Aesthetics
aes(x = Taxa, y = OTUs, fill = ...) + # fill our boxplots with color based on Taxa
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
# 3. Scaling
scale_y_log10() +
# 4. Geoms
geom_boxplot(...) + # Specify the colour of outliers
# 6. Facets
facet_grid(~Country)
Let's start off by sprucing up our plot with
ggtitle() to add a titleylab() to rename for those less acquainted with the OTU acronymxlab() to remove the "Taxa" labelguides() to remove the legendI also want to add a title, modify the y-axis label to say that the data is log, and remove the x-axis label. Note that I remove the x-axis label by using 'NULL'.
# Update the various titles on our plot
# 1. Data
ggplot(sub_filtered_latrines) +
# 2. Aesthetics
aes(x = Taxa, y = OTUs, fill = Taxa) + # fill our boxplots with color based on Taxa
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
# Update our titles and remove the legend
...("Abundance of Taxa by Country") +
...(NULL) +
...("Operational Taxonomic Units") +
...(fill="none") +
# 3. Scaling
scale_y_log10() +
# 4. Geoms
geom_boxplot(outlier.colour = "red") + # Specify the colour of outliers
# 6. Facets
facet_grid(~Country)
labs() command to control multiple labels¶Use individual commands to alter the x-, y-axis titles and the title of your plot can give you control over aspect of each individual element like font, size, and colour. If you want them to all have a uniform aesthetic, you can simply use the labs() command. This can include legend titles too!
# Update the various titles on our plot
# 1. Data
ggplot(sub_filtered_latrines) +
# 2. Aesthetics
aes(x = Taxa, y = OTUs, fill = Taxa) + # fill our boxplots with color based on Taxa
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
# Update our titles and remove the legend
# Use the labs() command to set all of your labels
...(title = "Abundance of Taxa by Country",
x = NULL,
y = "Operational Taxonomic Units") +
guides(fill="none") +
# 3. Scaling
scale_y_log10() +
# 4. Geoms
geom_boxplot(outlier.colour = "red") + # Specify the colour of outliers
# 6. Facets
facet_grid(~Country)
I also want to capitalize the Country labels. This can be done in a couple of ways.
One way would be to change the values in the dataset using string manipulation. A second way, would be using the labeller() function. I can make a vector of the capitalized names to replace 'Tanzania' and 'Vietnam'. The data is split by Country in the facet_grid() and this is where we pass our labels to labeller(), which will output the names on the strip label.
I am now going to save this plot in a ggplot object, since we are going to use this as our base plot for the next section. I am also going to rotate the x-axis text again. We will talk about theme elements soon, but in the meantime it won't drive me bonkers.
# Make a named character vector for our labels
country_labels <- c(Tanzania = "TANZANIA", Vietnam = "VIETNAM")
# Assign our plot to an object for alteration later on
my_plot <-
# 1. Data
ggplot(sub_filtered_latrines) +
# 2. Aesthetics
aes(x = Taxa, y = OTUs, fill = Taxa) + # fill our boxplots with color based on Taxa
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
# Update our titles and remove the legend
# Use the labs() command to set all of your labels
labs(title = "Abundance of Taxa by Country",
x = NULL,
y = "Operational Taxonomic Units") +
guides(fill="none") +
# 3. Scaling
scale_y_log10() +
# 4. Geoms
geom_boxplot(outlier.colour = "red") + # Specify the colour of outliers
# 6. Facets
facet_grid(~Country, labeller = ...) # Set the Country labels to our character vectors
# display our plot
my_plot
A common thing to want to do is to change colours from ggplot2's default rainbow palette. There are many reasons to change a colour palette including
Let's create our own colour palette for Taxa.
There are 3 main types of colour palettes.
Sequential - implies an order to your data - i.e. light to dark implies low values to high values, e.g. heatmaps.
# Load the RColorBrewer library
library(RColorBrewer)
# display the sequential colour palettes
display.brewer.all(type = "seq")
Diverging - low and high values are extremes, and the middle values are important - still goes from light to dark, but 3 colours mainly used.
# Display the diverging colour palettes
display.brewer.all(type = "div")
display.brewer.all(type = "qual")
Let's test one of the RColorBrewer palettes out on our data. We'll add it as a layer to my_plot using scale_fill_brewer() to override the fill mappings defined in the aes() layer of the plot.
my_plot + ...(palette = "Spectral")
ggplot¶Notice the warning we received: "n too large..."? Note that we have 24 different taxa along our x-axis but the Spectral palette only has 11 colours. Unlike when we see vector recycling in previous lectures, this does not occur when supplying a colour palette to our plot. In generating our plot, we only colour the first 11 colours.
RColorBrewer colour palettes can be created with brewer.pal()¶Many colour palettes now exist. I'll showcase a couple that work nicely with ggplot2. These packages also have colorblind friendly options. RColorBrewer has options for these 3 types of palettes, which you can see with display.brewer.all(). With a smaller dataset, we could make a call in ggplot directly to scale_fill_brewer(), which just requires choosing one of RColorBrewer's palettes, such as "Spectral". However, we have 24 Taxa and these palettes have 8-12 colours, so we have to get creative.
Using the brewer.pal() function, we can pull different colours from palettes of our choosing. In our case, I have simply taken the 2 qualitative palettes that each have a length of 12, put them into one palette, and made sure my values were unique. This can then be passed to ggplot via scale_fill_manual().
display.brewer.all()
Looks like we can use the Paired and Set3 palettes since they both have 12 colours that seem distinct enough. There may be some close colours though.
# Generate 2 palettes from the longest ones
palette1 <- brewer.pal(12, "...")
palette2 <- brewer.pal(12, "...")
# combine into a single palette
custom <- unique(c(palette1, palette2))
# Do we still have enough colours?
custom
length(custom)
Looks like we have enough colours to satisfy our needs. Notice that these are coded using a hexadecimal system? Let's provide this vector as input.
# Update our plot by adding colour
my_plot + ...(values = custom)
You can always choose a vector of your own colors using this 'R color cheatsheet' .
If you just want a repeating patterns of colours, you can use the rep() command to help you out too!
# Reminder of how the rep() command works
rep(c(1,2,3,4), ...)
# Fill the boxplot using a rep() command
my_plot + scale_fill_manual(values=rep(c("purple", "cornflowerblue", "grey", "yellow", "orange", "..."), 4))
viridis package¶Sometimes you may wish to work with a colour palette that best represents a continuous series of diverging values. In this case you may also want to ensure your colour palette avoids issues for readers that are printing in greyscale or those that may be colour blind. The viridis package contains some colour blind accessible palettes that can also help to really differentiate between the extremes of your spectrum.
The viridis package also has some nice color options (https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html). While these might all be diverging palettes (qualitative is best for our Taxa), we will showcase a couple here.
# Load the viridis package
library(viridis)
# Example 1 with viridis
my_plot + ...(discrete = TRUE)
# Example 2 with viridis (plasma)
my_plot + scale_fill_viridis(discrete = TRUE, option = "...")
RSkittleBrewer is another option for funky colour palettes.
ggsci has a variety of color palettes inspired by different scientific journals as well as television shows (https://cran.r-project.org/web/packages/ggsci/vignettes/ggsci.html).
As mentioned earlier, it is possible to customize every single aspect of a ggplot. Most of this occurs with a call to theme(), which you can think of as modifying everything BUT your data. For example, my axis labels can be modified, but they (hopefully) have something to do with my data. However, changing the size of the text or the font of the labels is unrelated to my data, and the same structure (text font & size) could be carried over to other plots if I saved my own theme.
Things that you can change with theme() include the axis, legend, panels, gridlines, or background.
Each element of a theme inherits from one of:
element_text (text elements like font, colour, size, face (bold, italics), alignment), element_line (grid lines, axis lines), element_rect (panels and backgrounds - colour, size, fill), element_blank (assigns nothing, usually when you are trying to get rid of something), element_grob (making a grid grob).ggplot2 comes with some themes - I suggest starting with the one that is close to what you want, and start modifying from there.
Check out these themes:
theme_minimal()theme_classic()theme_bw()theme_void()theme_dark()theme_gray()theme_light()You can look at the default for each theme simply by typing it into the console.
theme_bw
And this is what theme_bw() practically looks like:
# Alter the theme of my_plot
my_plot + ...
Notice how that last call to theme overrides my previous call to theme if there is a conflict. In this case the angle of the x-axis text is returned from a vertical to a horizontal orientation.
Here is an example of theme_dark(). I am going to override the default x-axis text angle of this theme by modifying it AFTER I call theme_dark().
# Alter my_plot and fix the x-axis
my_plot +
theme_dark() +
...(axis.text.x = element_text(angle = 90, hjust = 1))
# When building plots from scratch, be sure to place the theme_* above other theme changes
# 1. Data
ggplot(sub_filtered_latrines) +
# 2. Aesthetics
aes(x = Taxa, y = OTUs, fill = Taxa) +
... + # Add the dark theme first!
theme(axis.text.x = element_text(angle = 90, hjust = 1)) + # then alter your other thematic elements
labs(title = "Abundance of Taxa by Country",
x = NULL,
y = "Operational Taxonomic Units") +
guides(fill="none") +
# 3. Scaling
scale_y_log10() +
# 4. Geoms
geom_boxplot(outlier.colour = "red") + # Specify the colour of outliers
# 6. Facets
facet_grid(~Country, labeller = labeller(Country = country_labels)) # Set the Country labels to our character vectors
ggthemes package¶ggthemes is a package of themes. Some of these themes are based off of graphs seen in print or on websites (the economist, wall street journal, fivethirtyeight) or to match standard tools (excel, google docs).
Here are 2 possible themes.
# Load the ggthemes package
library(ggthemes)
# Add the economist theme to our plot
my_plot +
... + # Do you enjoy blue background panels?
theme(axis.text.x = element_text(angle=90, hjust=1)) # fix the x-axis
# An example of replicating the style from "Stata" software
my_plot +
... + # Blue Plot background with white paneling
theme(axis.text.x = element_text(angle=90, hjust=1))
You can also make your own custom theme as demonstrated here: http://joeystanley.com/blog/custom-themes-in-ggplot2
I am going to show you how to customize a plot, starting from theme_minimal() because I don't like the grey backgrounds or harsh axis lines.
# Start by using the minimal theme
my_plot +
theme_minimal()
theme() layer¶Depending on the layout of your plot you can institute changes to the theme as you build your plot or afterwards. Just remember, each call to theme() will override any previous calls that conflict, so the order of changes is important. Many arguments to theme() represent major element categories, but there can be arguments that specifically represent sub-categories or sub-elements.
Things I don't like about this plot and their solutions:
| Problem | Solution | Layer / Command |
|---|---|---|
| x-axis labels overlap and are small | rotate labels | axis.text.x |
| country labels are smaller than axis labels | change size and face | strip.text.x |
| title is uncentered | adjust position horizontally | plot.title |
| need a border to separate countries | create a border around each panel | panel.border |
| add y axis ticks | update y axis ticks | axis.ticks.y |
As mentioned the last call to theme() will override previous calls that conflict. Therefore, if we want to start with theme_minimal() as our base, it has to be in our code BEFORE the other modifications.
# Add our own theme elements
my_plot +
theme_minimal() + # start with theme minimal
theme(axis.text.x = ...(angle = 90, hjust = 1, vjust=0.5, size=14), # Adjust x-axis text and position
panel.border = ...(fill=NA), # Add a panel border to each facet
strip.text.x = element_text(face = "bold", size = 16), # alter the facet title text
plot.title = element_text(hjust=0.5, size = 18), # Centre that plot title
axis.ticks.y = ...()) # Add some little tick marks on the y-axis
# Note that you could break this into multiple theme() calls as well!
Alter my_plot by:
# Fill the blanks
my_plot +
theme_minimal() + # start with theme minimal
# Our previous theme update
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust=0.5, size=14),
panel.border = element_rect(fill = NA),
strip.text.x = element_text(face = "bold", size = 16),
plot.title = element_text(hjust=0.5, size = 18),
axis.ticks.y = element_line()) +
# Our current theme update
theme(plot.background = ...(fill = "..."), # Set the background to a rectangle with new colour
panel.grid.minor = element_line(), # Add minor grid lines
panel.grid.major = element_line(colour="black")) # Add black major grid lines
There are a lot of way to customize your plots! Keep exploring and playing with parameters!
You may be wondering, can I save this awesome theme to apply to all my amazing plots? Yes, there are a number of ways to import your themes to other scripts if you learn to save your data objects to file in Lecture 07! For now, you can assign your themes to a variable and apply them to plots like any other layer.
# Save you theme to a variable
... <-
theme_minimal() + # start with theme minimal
# Our previous theme update
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust=0.5, size=14),
panel.border = element_rect(fill = NA),
strip.text.x = element_text(face = "bold", size = 16),
plot.title = element_text(hjust=0.5, size = 18),
axis.ticks.y = element_line()) +
# Our current theme update
theme(plot.background = element_rect(fill = "cornflowerblue"), # Set the background to a rectangle with new colour
panel.grid.minor = element_line(), # Add minor grid lines
panel.grid.major = element_line(colour="black")) # Add black major grid lines
# Apply your theme as a layer
my_plot + theme_personal
Up until now, we have taken for granted that our plots have been displayed using a Graphic Device. For our Jupyter Notebooks we can see the graphs right away and update our code. You can even save them manually from the output display but sometimes you may be producing multiple visualizations based on large data sets. In this case it is preferable to save them directly to file.
Plots must be created on a graphics device
The default graphics device is almost always the screen device, which is most useful for exploratory analysis.
File devices are useful for creating plots that can be included in other documents or sent to other people.
For file devices, there are vector (pdf, svg, postscript) and bitmap (png, jpeg, tiff) formats.
Vector formats are good for line drawings and plots with solid colors using a modest number of points.
Bitmap formats are good for plots with a large number of points, natural scenes or web-based plots.
(https://rdpeng.github.io/Biostat776/notes/pdf/grdevices.pdf)
ggplot2 has its own function for saving its graphics: ggsave(). This allows us to skip the step of explicitly calling separate graphics devices and shutting them down afterwards (if you have saved plots in base R or lattice, this will sound familiar to you).
You can send the plot object to the screen device to preview your image, and then save that image by specifying the file device. If you do not specify the device type, ggsave() will guess it from your filename extension (pdf, jpeg, tiff, bmp, svg or png). Note that this will save whatever graphic was last on your screen device.
With ggsave() you can minimally input the filename you would like to have, and the path to your file.
# Save the last plot displayed by ggplot
ggsave("...", path = "data")
However, in some cases you want to tailor your output. You can specify the width, height and units of your image, or you can apply a scaling factor (the 'eyeballing' approach). You can also specify the plot object you want to save instead of whatever was on your graphics device last using the 'plot' parameter. Note that this time I have combined the path with the filename, and called the file device type separately.
# Save our altered plot to an object
saved_plot <- my_plot + theme_personal
# Specifically make saved_plot a pdf!
ggsave("data/crazy_blue_graph2.pdf", # The path for our output
plot = saved_plot, # The object we want to save
device = "pdf", # explicitly name the type of file we want to make, despite the name
scale = 2, width = 250, height = 110, units = "...") # Set some parameters for the final size
No image is sent to the screen device when a file is saved in this manner.
ggpub package¶There are many fantastic R packages to analyze and visualize your data. As a group, we are likely working in a variety of specialized areas. The plots we have made so far today should be useful for data exploration for many different kinds of data. In the next section we are going to preview some more complex visualization types, but since these take more time to go through and not everyone may be interested in interactive graphics, network diagrams, time-series analysis, or geospatial data, we will not be plotting all of these together. We will, however learn how to arrange multiple plots per page, and also how to make an upset plot.
ggarrange()¶There are a variety of methods to mix multiple graphs on the same page, however ggplot2 does not work well with all of them. I am going to work with a package base called ggpubr which allows us to align the axes of our plots. This package relies on gridExtra(which allows us to arrange plots) and works well with ggplot2.
For a demonstration, we are going to take 3 plots that we made earlier (a boxplot, a histogram, and a dot plot), save them as objects, and then arrange and align them in the same figure. (http://www.sthda.com/english/rpkgs/ggpubr/)
ggarrange() is a function from ggpubr that takes your plots, their labels, and how you would like your plots arranged in rows and columns. It takes the form of:
ggarrange(
...,
plotlist = NULL,
ncol = NULL,
nrow = NULL,
labels = NULL,
label.x = 0,
label.y = 1,
hjust = -0.5,
vjust = 1.5,
font.label = list(size = 14, color = "black", face = "bold", family = NULL),
align = c("none", "h", "v", "hv"),
widths = 1,
heights = 1,
legend = NULL,
common.legend = FALSE,
legend.grob = NULL
)
Of the parameters some relevant ones for us are:
... - the list of plots to be arranged as a grid or alternatively useplotlist - An optional list of plots to displaylabels - An optional list of labels for each plotncol - number of columns in the plot grid (optional)nrow - number of rows in the plot grid (optional)Some examples of simple grid arrangements are :
To start, we want our boxplot and dot plot side by side. If you picture each plot as a square in a grid, we need two columns (one for each plot, ncol = 2) and one row (since they are side by side, nrow = 1).
# Load the ggpubr package
library(ggpubr)
# Create a histogram
hist <-
# 1. Data
ggplot(filtered_latrines) +
# 2. Aesthetics
aes(x=OTUs, fill=Country) +
# 3. Scaling
xlim(0,1000) +
ylim(0,150) +
# 4. Geoms
geom_histogram(binwidth = 50, position = "dodge", alpha=0.3) +
geom_rug()
# Create a dot plot
dot <-
# 1. Data
ggplot(metadata) +
# 2. Aesthetics
aes(x=TS, y=CODt) +
# 3. Scaling
scale_y_log10() +
# 4. Geoms
geom_point() +
# 5. Stats
stat_smooth(method = lm) +
# 6. Facets
facet_grid(~Country)
# Set up a boxplot for our example
box <-
# 1. Data
ggplot(subdat) +
# 2. Aesthetics
aes(x = Taxa, y = OTUs) +
# 3. Scaling
scale_y_log10() +
# 4. Geoms
geom_boxplot() +
geom_quasirandom(varwidth=TRUE)
# Arrange the two plots in a single page
ggarrange(dot, hist, # Plots (and their order)
... = c("A", "B"),
ncol = ..., nrow = ...)
theme() layer¶While the grid areas are of the same size, the backgrounds are not. Let's adjust the legend of our histogram so that it is in the top right corner of the plot, and remove the white background.
Things to note:
legend.position parameter uses a tuple where each value is {0,1}. (0,0) is the lower-left and (1,1) is the upper-right of a graph. You can also specify "left", "right", "top", and "bottom" for positions outside your graph.legend.background parameter and others can be set to elements like a element_rect() but they can also be removed using the placeholder element_blank().# Alter our histogram
hist <-
# 1. Data
ggplot(filtered_latrines) +
# 2. Aesthetics
aes(x=OTUs, fill=Country) +
theme(legend.justification=..., # Set the anchor position of the legend box.
legend.position=..., # Reposition the legend based on its anchor to the top right
legend.background = ...) + # Remove the background of the legend
# 3. Scaling
xlim(0,1000) +
ylim(0,150) +
# 4. Geoms
geom_histogram(binwidth = 50, position = "dodge", alpha=0.3) +
geom_rug()
# Arrange the plots again
ggarrange(dot, hist,
labels = c("A", "B"),
ncol = 2, nrow = 1)
Next we will add in the boxplot by nesting a ggarrange() call within another.
Imagine a square with 4 boxes.
To do this, we are arranging 2 columns (one with the boxplot and one with the [histogram + bubble plot], ncol = 2) and we are arranging 2 rows (one with the histogram and one with the scatterplot, nrow = 2).
# Build our new grid setup
# 1. First call initiates the 2-column grid
ggarrange(box, # The left-hand column is a boxplot
...(hist, dot, # The right-hand column is a nested call with two plots
labels = c("B", "C"),
nrow = ...), # arrange the right-hand column as two rows
ncol = 2, labels = "A") # Arrange the outer grid as two columns
align and font()¶If y-axis lines or x-axis lines are not aligned, this can be fixed with a call to align = "v" or align="h". Note that this will align the edges of the plot object, not the panels that represent data alone.
To make sure all axis titles are the same size (B and C look a little off to me), we can use font() to specify which text we want changed and the size we want to change it to. I am also going to make the legend title size the same.
Let's look at the font() function, which is actually part of the ggpubr package. You'll see that we can treat it much like adding a layer to our plots as we use the + operator. It acts like a wrapper to directly alter the ggplot object through underlying calls to the theme layer. Although limited, there are a number of elements that it can affect, including fonts for:
Let's try out the font() function now and save the result into a new variable multiplot. You'll notice it's still not quite perfect but better in many places.
# Alter the fonts of our layout
# set all axis and legend fonts to size 9
multiplot <-
ggarrange(box + ..., # Alter boxplot fonts
ggarrange(hist +
font("axis.title", size=9) + # Alter histogram fonts
font("legend.title", size=9),
dot +
font("axis.title", size=9), # Alter scatterplot fonts
labels = c("B", "C"),
nrow = 2, align = "v"), # Try to align the vertical axis of the histogram and scatterplot
ncol = 2, labels = "A")
multiplot
ggsave()¶The ggarrange objects, while structurally different from ggplot objects, inherit much of their information and behaviours from the ggplot class. Therefor, you can use other ggplot functions like ggsave() to write your plots to file. The calls follow the same format as previous examples we've used so let's give it a try.
# Confirm the object type of our multiplot
class(multiplot)
# Save it to a JPEG file for using in our presentations
ggsave(plot = ..., file="data/multiplot.jpg", width = 200, height = 110, units = "mm")
Using the code available here for a density plot we made earlier in the lesson, make a combined figure. We are going to put our density plot across the top row, the histogram in the bottom left corner, and the box plot in the bottom right corner. Make sure the legend and axis titles are the same size. Change the legend text for the line graph to be smaller than the legend title.
# Remake our density plot from earlier
dens <-
# 1. Data
ggplot(filtered_latrines) +
# 2. Aesthetics
aes(x=OTUs, fill=Country) +
theme(legend.justification=c(1,1),
legend.position=c(1,1),
legend.background = element_blank()) +
# 3. Scaling
xlim(0, 1000) +
# 4. Geoms
geom_density(alpha = 0.3) +
geom_rug()
# Let's arrange a new set of panels
ggarrange(dens +
font("axis.title", size=9) +
font("legend.title", size=9) +
font("legend.text", size = ...), # So we've set axis and legend titles to 9, legend text to 6
ggarrange(hist +
font("axis.title", size=9) +
font("legend.text", size = 6) +
font("legend.title", size=9), # subpanel histogram with the same properties
box +
font("axis.title", size=9), # subpanel boxplot with size 9 axis title
labels = c("B", "C"), # label the subpanels as B and C
ncol = 2, align = "h"), # set 2 columns in this subpanel
nrow = 2, labels = "A") # set two rows for the main panel
UpSetR: https://github.com/hms-dbmi/UpSetR
Upset plots are an alternative to Venn diagrams that show the intersection of sets, as well as the size of the sets. Additionally, Venn diagrams can be difficult to interpret for greater than 2 or 3 sets. This is a real life figure from BMC Bioinformatics. Sure it looks pretty, but what does the number 24 represent in this picture in terms of A, B, C, D, and E?
</br>
UpsetR to visualize overlapping datasets¶Let's see how UpSet plots work practically.
Question: For 8 Taxa, how many sites are those Taxa present at, and what is the overlap between sites with other Taxa?
I have picked 8 different Taxa, which would make a crazy venn diagram with a lot of zeros in it since a couple of these Taxa are rare.
subtax <- read_csv("data/SPE_pitlatrine.csv")
eight <- c("Chloroflexi", "Acidobacteria_Gp16", "Acidobacteria_Gp5", "Cyanobacteria",
"Lentisphaeria", "Deinococci", "Alphaproteobacteria", "...")
#subset for the 8 taxa
subtax <- subtax %>% filter(Taxa ... eight)
subtax
The data that we have is an OTU table. It is 8 rows (as expected) by 82 columns (81 sample entries)
For this purpose we simply want to know whether a given Taxa was present or absent in the sample, and then we can form intersections based on this information. So, for each OTU value that is not 0, we replace it with a 1 instead.
# This is a tidyr way of transposing our data frame so Taxa are in columns and sites are in rows.
# This format is for the UpSet plot only (we can do the value replacement in either format)
subtax <-
subtax %>%
pivot_longer(cols = ..., names_to = "sample", values_to = "otus") %>%
pivot_wider(names_from = ..., values_from = ...)
head(subtax)
In our next step, as we try to replace all non-zero values with a 1, we are using our good friend apply() in addition with a conditional statement ifelse(). The ifelse() function allows us to test with a conditional and assign a value based on whether or not the answer is TRUE or FALSE.
# If any OTUs are present, substitute the value with a 1 (exclude the first row which is character).
#Take a look at the apply function syntax:
subtax[,-1] <- apply(subtax[,-1],
MARGIN = 2, # Margin to apply the function over. 1 = rows, 2 = columns
function(x) ...) # Conditional statement
head(subtax)
str(subtax)
upset() function to generate an UpSet plot¶Now that we've properly formatted our table subtax, it has 81 rows (samples) by 9 columns (8 taxa that we are investigating)
To use the upset() plotting function, we enter our data set, the number of sets we are inputting, if we want to order the results (in this case by frequency), and how many intersections we want to show. Here, I will show 15 intersections - we know the remaining intersections would be zero since this is ordered by frequency.
# Load the UpSetR package
library(UpSetR)
upset(...(subtax), # Our dataset
nsets = ..., # The number of set we are providing (8 taxa)
empty.intersections = TRUE, # Should it include empty intersections in the output?
order.by = "freq", # Order the output by frequency
nintersects = 15, # Show a total of 15 intersects
main.bar.color = "black", # Set the bargraph colour
sets.bar.color = "#56B4E9") # Set the set bargraph colour (lower left)
# This UpSet plot show taxa co-occurrence -taxa found together
Our plot can be broken into 3 sections.
nintersects.From our result, our greatest intersection size is 28 instances. This means that 28 of our 81 samples have Clostridia, Alphaproteobacteria, and Deinococci present in the same sample WITHOUT Chloroflexi, both Acidobacteria, Cyanobacteria and Lentisphaeria.
We can see from the Set Size that Clostridia and Alphaproteobacteria are present in almost all samples, and Deinococci is present in greater than 3/4 of the samples. You might expect this overlap to be higher, but remember that this is without all the other Taxa.
Moving along our interesection sizes and dot matrix, we can see that other intersections include these bacteria. For example, the next bar to the right with an intersection size of 16, also includes Lentisphaeria. Some quick mental math shows that 63 samples have these top 3 Taxa present.
While we have just scratched the surface of ggplot, as mentioned earlier in lecture there are many additional visualization packages that can work with more specific types of data. In some cases, these packages add functionality to the ggplot package itself!
Plotly: https://plot.ly/r/
ggvis: http://ggvis.rstudio.com/interactivity.html
Heatmaps: https://github.com/talgalili/heatmaply
Interactive time-series data: https://rstudio.github.io/dygraphs/
visNetwork (based on igraph): https://datastorm-open.github.io/visNetwork/edges.html
Migest: https://gjabel.wordpress.com/category/r/migest/
Static Maps:
Interactive Maps:
ggtree:
treeman:
metacoder:
phyloseq:
ggbio:
GenVisR:
GenomeGraphs:
Challenge
Install and load the gapminder package. Save the gapminder data into an object using latrines <- gapminder. Plot the following:
That's the end for our fourth class on R! We took a break from data wrangling this week to focus on the basics of data visualization including:
ggplot.Soon after the end of each lecture, a homework assignment will be available for you in DataCamp. Your assignment is to complete all chapters from the Introduction to Data Visualization with ggplot2 course which has a total of 4 chapters and 4,300 points. This is a pass-fail assignment, and in order to pass you need to achieve a least 3,225 points (75%) of the total possible points. Note that when you take hints from the DataCamp chapter, it will reduce your total earned points for that chapter.
In order to properly assess your progress on DataCamp, at the end of each chapter, please take a screenshot of the entire course summary. You'll see this under the "Course Outline" menubar seen at the top of the page for each course and you'll want to expand each section. It should look something like this:
You may need to take several screenshots if you cannot print it all in a single try. Submit the file(s) or a combined PDF for the homework to the assignment section of Quercus. By submitting your scores for each section, and chapter, we can keep track of your progress, identify knowledge gaps, and produce a standardized way for you to check on your assignment "grades" throughout the course.
You will have until 13:59 hours on Thursday, October 14th to submit your assignment (right before the next lecture).
Revision 1.0.0: materials prepared in R Markdown by Oscar Montoya, M.Sc. Bioinformatician, Education and Outreach, CAGEF.
Revision 1.1.0: edited and preprared in Jupyter Notebook by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
This class is supported by DataCamp, the most intuitive learning platform for data science and analytics. Learn any time, anywhere and become an expert in R, Python, SQL, and more. DataCamp’s learn-by-doing methodology combines short expert videos and hands-on-the-keyboard exercises to help learners retain knowledge. DataCamp offers 350+ courses by expert instructors on topics such as importing data, data visualization, and machine learning. They’re constantly expanding their curriculum to keep up with the latest technology trends and to provide the best learning experience for all skill levels. Join over 6 million learners around the world and close your skills gap.
Your DataCamp academic subscription grants you free access to the DataCamp's catalog for 6 months from the beginning of this course. You are free to look for additional tutorials and courses to help grow your skills for your data science journey. Learn more (literally!) at DataCamp.com.
Wickham, Hadley. (2010). A Layered Grammar of Graphics. Journal of Computational and Statistical Graphics.
Wilkinson, L. (2005), The Grammar of Graphics (2nd ed.). Statistics and Computing, New York: Springer. [14, 18]
Tufte, Edward R. The Visual Display of Quantitative Information.
http://www.cookbook-r.com/Graphs/
https://github.com/jennybc/ggplot2-tutorial
http://stcorp.nl/R_course/tutorial_ggplot2.html
http://ggplot2.tidyverse.org/reference/theme.html
http://joeystanley.com/blog/custom-themes-in-ggplot2
https://github.com/jrnold/ggthemes
https://www.nceas.ucsb.edu/~frazier/RSpatialGuides/colorPaletteCheatsheet.pdf
https://cran.r-project.org/web/packages/ggrepel/vignettes/ggrepel.html
http://www.cookbook-r.com/Graphs/Legends_(ggplot2)/
https://github.com/eclarke/ggbeeswarm
https://cran.r-project.org/web/packages/ggsci/vignettes/ggsci.html
http://www.sthda.com/english/rpkgs/ggpubr/
https://rpubs.com/drsong/9575
http://elpub.bib.uni-wuppertal.de/edocs/dokumente/fbb/wirtschaftswissenschaft/sdp/sdp15/sdp15006.pdf
http://www.sthda.com/english/articles/24-ggpubr-publication-ready-plots/81-ggplot2-easy-way-to-mix-multiple-graphs-on-the-same-page/
https://rdpeng.github.io/Biostat776/notes/pdf/grdevices.pdf
Here I've compiled some additional examples of plots using ggplot to help get you started on working with this beloved package.
In ggplot2, circular plots are related to bar graphs - they just have different coordinate systems. The default coordinate system is cartesian coordinates, and we need to switch to polar coordinates to make a circle.
This circular barchart is known by a few names including the Nightingale plot, rose chart, and Coxcomb plot. Looking at the plot, pH levels increase as you move up the outer rings, and depth increases as you move clockwise around the circle. Colour is still represented by country.
# Reload our metadata file
metadata <- read_csv("data/metadata_pitlatrine.csv",
col_types = 'cccnnnn') # Here we are explicitly specifying our column types
# Reload our latrine data
# Load up the taxa_pitlatrine.csv dataset
latrines <- read_csv("data/taxa_pitlatrine.csv", col_types = 'cccii')
# 1. Data
ggplot(metadata) +
# 2. Aesthetics
aes(x=Depth, y=pH, fill = Country) +
# 4. Geoms
geom_bar(stat="identity", position = "dodge") +
# 7. Coordinates
coord_polar()
Since we're on the topic of polar coordinates, to make your classic pie chart, use theta to specify what variable is going to be used to make up the angles (width of pie slices). To wrap to a full circle instead of having sections (as in the above Coxcomb plot), the width is set to one.
# 1. Data
ggplot(latrines) +
# 2. Aesthetics
aes(x="", y=Country, fill = Country) +
# Geoms
geom_bar(stat = "identity", width = 1) +
# Coordinates
coord_polar(theta = "y")
We briefly discussed line graphs in Lecture 03. Let's revisit how to make this visualization. To draw a line graph, we select geom_line().
# 1. Data
ggplot(metadata) +
# 2. Aesthetics
aes(x=Temp, y=pH) +
# 4. Geoms
geom_line()
We can colour the lines for the different countries. Looking at our previous graph, what must have happened when these values were over the same range?
# Generate a line plot with lines coloured by country
# 1. Data
ggplot(metadata) +
# 2. Aesthetics
aes(x=Temp, y=pH, colour = Country) +
# 4. Geoms
geom_line()
To plot error bars, we need to give the geom, geom_errorbar(), and the interval over which we want to draw the bar. This interval is usually one standard deviation, one standard error or a confidence interval.
Luckily, we have learned how to calculate the mean and standard deviation with dplyr. Once we have that, all we need to do is add or subtract the standard deviation from our data points to get the bounds of the error bar.
geom_errorbar() takes these bounds passed to the parameters ymax and ymin. In this case, since there were no sample replicates the standard deviation is taken from the mean of all of the soil readings in a given country at a given depth.
# Pregenerate the error bar data by adding a mean and standard deviation variable
# Note that these values are derived from the overall mean of the grouping
errdat <-
metadata %>%
group_by(Country, Depth) %>%
mutate(mean_pH = mean(pH),
sd_pH = sd(pH))
# 1. Data
ggplot(errdat) +
# 2. Aesthetics
aes(x=Temp, y=pH, colour = Country) +
# 4. Geoms
geom_line() +
geom_errorbar(aes(ymin=pH-sd_pH, ymax = pH+sd_pH), width = 0.2, alpha = 0.4)
The first plots we made were scatterplots with 2 continuous variables. With one discrete variable and one categorical variable, we can make a stripplot. We use the point geom for both of these types of plots.
# 1. Data
ggplot(latrines) +
# 2. Aesthetics
aes(x = Depth, y = OTUs) +
# 4. Geoms
geom_point()
Again, we have a lot of values crushed near the x-axis. If we add log scaling to the y-axis, we get an error that this transformation created infinite values. This is because we have zeros in our data set.
# 1. Data
ggplot(latrines) +
# 2. Aesthetics
aes(x = Depth, y = OTUs) +
# 3. Scaling
scale_y_log10() +
scale_x_continuous(breaks = c(1:12)) +
# 4. Geoms
geom_point()
Again we get a warning that has to do with the conversion of our values to log10(OTU) which will produce infite values for OTU == 0. Knowing this, we could ignore the warning, or add +1 to each OTU (negligable on a log scale).
# 1. Data
ggplot(latrines) +
# 2. Aesthetics
aes(x = Depth, y = OTUs + 1) +
# 3. Scaling
scale_y_log10() +
scale_x_continuous(breaks = c(1:12)) +
# 4. Geoms
geom_point()
In order to see the points a little better, we can 'jitter' them. Jittering spreads out our data points while keeping them in the same area of the plot so we can get an idea of density.
# 1. Data
ggplot(latrines) +
# 2. Aesthetics
aes(x = Depth, y = OTUs + 1) +
# 3. Scaling
scale_y_log10() +
scale_x_continuous(breaks = c(1:12)) +
# 4. Geoms
geom_point(position = "jitter") +
# 6. Facets
facet_wrap(~Country)
With bubble plots, we need to provide the size of our bubbles in a way that is proportional to our data, and provide a scale to map these to.
Let's group our OTUs by Country and Taxa and look at the Taxa for the top 30 OTUs. As a refresher, try to do this using dplyr functions.
We are going to make our bubbles (which can be thought of as large points) with geom_jitter() so that our bubbles don't overlap. Remember that we want to specify what we are plotting with aesthetics. We want the bubbles representing OTUs_per_Taxa, so we divide each value by pi when calling size.
# Make a data set of the top 30 Taxa with highest total OTUs in a single country
bubble_latrines <-
latrines %>%
group_by(Taxa, Country) %>% # group data by Taxa and Country
summarize(OTUs_per_Taxa = sum(OTUs)) %>% # Summarize the sum of the OTUs for each group
arrange(desc(OTUs_per_Taxa)) %>% # Sort the data
.[1:30,] # Keep the top 30 entries
bubble_latrines
# Plot the data but alter the size based on a variable like OTUs_per_Taxa
# 1. Data
ggplot(bubble_latrines) +
# 2. Aesthetics
aes(x = Country, y = OTUs_per_Taxa, fill=Taxa) +
guides(fill = "none") + # remove the legend for fill
# 3. Scaling
scale_y_log10() +
# 4. Geoms
geom_jitter(aes(size = OTUs_per_Taxa/pi), pch = 21, show.legend = TRUE)
However, it is hard to tell proportionally how different these sizes are. Mapping values to a scale gives us more of an idea of their relative sizes. This is done by giving a range of values to scale_size_continuous(). This range is the size you want your plotting symbols (bubbles) to be after transformation. You can use trial and error to see what range you like; it is not changing your data, but rather your perception of the data.
# 1. Data
ggplot(bubble_latrines) +
# 2. Aesthetics
aes(x = Country, y = OTUs_per_Taxa, fill=Taxa) +
guides(fill = "none") + # remove the legend for fill
# 3. Scaling
scale_y_log10() +
scale_size_continuous(range = c(1,30)) + # Alter the range of the sizes from the default applied in the previous graph
# 4. Geoms
geom_jitter(aes(size = OTUs_per_Taxa/pi), pch = 21, show.legend = TRUE)
ggplot2 provides 2 methods of labelling: geom_label() and geom_text(). Let's look at the help menu to find out what the difference between them is.
We have to specify with aesthetics what we wish to plot for our label.
# 1. Data
ggplot(bubble_latrines) +
# 2. Aesthetics
aes(x = Country, y = OTUs_per_Taxa, fill=Taxa) +
guides(fill = "none") + # remove the legend for fill
# 3. Scaling
scale_y_log10() +
scale_size_continuous(range = c(1,30)) + # Alter the range of the sizes from the default applied in the previous graph
# 4. Geoms
geom_jitter(aes(size = OTUs_per_Taxa/pi), pch = 21, show.legend = FALSE) +
geom_label(aes(label = Taxa), show.legend = FALSE) # Add labels from the Taxa variable but don't use a legend
This isn't fantastic because our labels are overlapping. What parameters might you be able to use to move the labels? Try using them to get the labels to not overlap.
The text geom has an option to check for overlap of labels. geom_text() does not provide a background for the label, and it may be harder to tell which label belongs to each data point.
# 1. Data
ggplot(bubble_latrines) +
# 2. Aesthetics
aes(x = Country, y = OTUs_per_Taxa, fill=Taxa) +
guides(fill = "none") + # remove the legend for fill
# 3. Scaling
scale_y_log10() +
# Alter the range of the sizes from the default applied in the previous graph
scale_size_continuous(range = c(1,30)) +
# 4. Geoms
# Update how we calculate the size by taking the square root of OTU/pi
geom_jitter(aes(size = sqrt(OTUs_per_Taxa/pi)), pch = 21, show.legend = FALSE) +
# Add labels from the Taxa variable but don't use a legend
geom_text(aes(label = Taxa), show.legend = FALSE)
What happened here? Why don't we have as many labels? Look in the help menu to explain this behaviour.
I don't like futzing with label positions, so I went looking for a package that would do this for me. ggrepel will 'repel' your labels away from each other without getting rid of them. Let's check it out with our bubble plot labels. Install and load ggrepel. The equivalent function we can use is geom_label_repel(). The force parameter allow you to modify how far you want your labels pushed away from each other.
Here, the box colour is being used to map to our data points but ggrepel can instead connect a line to data points by altering the value of segment.size. Variations on use can be found here: https://cran.r-project.org/web/packages/ggrepel/vignettes/ggrepel.html.
# Load up the ggrepel package
library(ggrepel)
# 1. Data
ggplot(bubble_latrines) +
# 2. Aesthetics
aes(x = Country, y = OTUs_per_Taxa, fill=Taxa) +
guides(fill = "none") + # remove the legend for fill
# 3. Scaling
scale_y_log10() +
# Alter the range of the sizes from the default applied in the previous graph
scale_size_continuous(range = c(1,30)) +
# 4. Geoms
# Update how we calculate the size by taking the square root of OTU/pi
geom_jitter(aes(size = sqrt(OTUs_per_Taxa/pi)), pch = 21, show.legend = FALSE) +
# Add labels from the Taxa variable but don't use a legend
geom_label_repel(aes(label = Taxa), force = 2, show.legend = FALSE, segment.size = 0)